Towards Unsupervised Extraction of Verb Paradigms from Large Corpora

نویسندگان

  • Cornelia H. Parkes
  • Alexander M. Malek
  • Mitchell P. Marcus
چکیده

A verb paradigm is a set of inflectional categories for a single verb lemma. To obtain verb paradigms we extracted left and right bigrams for the 400 most frequent verbs from over 100 million words of text, calculated the Kullback Leibler distance for each pair of verbs for left and right contexts separately, and ran a hierarchical clustering algorithm for each context. Our new method for finding unsupervised cut points in the cluster trees produced results that compared favorably with results obtained using supervised methods, such as gain ratio, a revised gain ratio and number of correctly classified items. Left context clusters correspond to inflectional categories, and right context clusters correspond to verb lemmas. For our test data, 91.5% of the verbs are correctly classified for inflectional category, 74.7% are correctly classified for lemma, and the correct joint classification for lemma and inflectional category was obtained for 67.5% of the verbs. These results are derived only from distributional information without use of morphological information. 1 I n t r o d u c t i o n This paper presents a new, largely unsupervised method which, given a list of verbs from a corpus, will simultaneously classify the verbs by lemma and inflectional category. Our long term research goal is to take a corpus in an unanalyzed language and to extract a grammar for the language in a matter of hours using statistical methods with minimum input from a native speaker. Unsupervised methods avoid " This work was supported by grants from Palladium Systems and the Glidden Company to the first author. The comments and suggestions of Martha Palmer, Hoa Trang Dang, Adwait Ratnaparkhi, Bill Woods, Lyle Ungar, and anonymous reviewers are also gratefully acknowledged. 110 labor intensive annotat ion required to produce the training materials for supervised methods. The cost of annotated data becomes particularly onerous for large projects across many languages, such as machine translation. If our method ports well to other languages, it could be used as a way of automatically creating a morphological analysis tool for verbs in languages where verb inflections have not already been thoroughly studied. Precursors to this work include (Pereira et al, 1993), (Brown et al. 1992), (Brill & Kapur, 1993), (Jelinek, 1990), and (Brill et al, 1990) and, as applied to child language acquisition, (Finch & Chater, 1992). Clustering algorithms have been previously shown to work fairly well for the classification of words into syntactic and semantic classes (Brown et al. 1992), but determining the optimum number of classes for a hierarchical cluster tree is an ongoing difficult problem, particularly without prior knowledge of the item classification. For semantic classifications, the correct assignment of items to classes is usually not known in advance. In these cases only an unsupervised method which has no prior knowledge of the item classification can be applied. Our approach is to evaluate our new, largely unsupervised method in a domain for which the correct classification of the items is well known, namely the inflectional category and lemma of a verb. This allows us to compare the classification produced by the unsupervised method to the classifications produced by supervised methods. The supervised methods we examine are based on information content and number of items correctly classified. Our unsupervised method uses a single parameter, the expected size of the cluster. The classifications by inflectional category and lemma are additionally interesting because they produce trees with very different shapes. The classification tree for inflectional category has a few large clusters, while the tree for verb lemmas has many small clusters. Our unsupervised method not only performs as well as the supervised methods, but is also more robust for different shapes of the classification tree. Our results are based solely on distributional criteria and are independent of morphology. We completely ignore relations between words that are derived from spelling. We assume that any difference in form indicates a different item and have not "cleaned up" the data by removing capitalization, etc. Morphology is important for the classification of verbs, and it may well solve the problem for regular verbs. However, morphological analysis will certainly not handle highly irregular, high frequency verbs. What is surprising is that strictly local context can make a significant contribution to the classification of both regular and irregular verbs. Distributional information is most easily extracted for high frequency verbs, which are the verbs that tend to have irregular morphology. This work is important because it develops a methodology for analyzing distributional information in a domain that is well known. This methodology can then be applied with some confidence to other domains for which the correct classification of the items is not known in advance, for example to the problem of semantic classification.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

An Overview of Open Information Extraction∗

Open Information Extraction (OIE) is a recent unsupervised strategy to extract great amounts of basic propositions (verb-based triples) from massive text corpora which scales to Web-size document collections. We will intoduce the main properties of this extraction method. 1998 ACM Subject Classification Dummy classification – please refer to http://www.acm.org/ about/class/ccs98-html

متن کامل

Dependency-Based Open Information Extraction

Building shallow semantic representations from text corpora is the first step to perform more complex tasks such as text entailment, enrichment of knowledge bases, or question answering. Open Information Extraction (OIE) is a recent unsupervised strategy to extract billions of basic assertions from massive corpora, which can be considered as being a shallow semantic representation of those corp...

متن کامل

Discovering Light Verb Constructions and their Translations from Parallel Corpora without Word Alignment

We propose a method for joint unsupervised discovery of multiword expressions (MWEs) and their translations from parallel corpora. First, we apply independent monolingual MWE extraction in source and target languages simultaneously. Then, we calculate translation probability, association score and distributional similarity of co-occurring pairs. Finally, we rank all translations of a given MWE ...

متن کامل

A Step-wise Usage-based Method for Inducing Polysemy-aware Verb Classes

We present an unsupervised method for inducing verb classes from verb uses in gigaword corpora. Our method consists of two clustering steps: verb-specific semantic frames are first induced by clustering verb uses in a corpus and then verb classes are induced by clustering these frames. By taking this step-wise approach, we can not only generate verb classes based on a massive amount of verb use...

متن کامل

Multilingual Open Information Extraction

Open Information Extraction (OIE) is a recent unsupervised strategy to extract great amounts of basic propositions (verb-based triples) from massive text corpora which scales to Web-size document collections. We propose a multilingual rule-based OIE method that takes as input dependency parses in the CoNLL-X format, identifies argument structures within the dependency parses, and extracts a set...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1998